7 research outputs found

    High Performance MP2 for Condensed Phase Simualations

    Get PDF
    This report describes the results of a PRACE Preparatory Access Type C project to optimise the implementation of Møller-Plesset second order perturbation theory (MP2) in CP2K, to allow it to be used efficiently on the PRACE Research Infrastructure. The work consisted of three stages: firstly serial optimisation of several key computational kernels; secondly, OpenMP implementation of parallel 3D Fourier Transform to support mixedmode MPI/OpenMP use of CP2K; and thirdly - benchmarking the performance gains achieved by new code on HERMIT for a test case representative of proposed production simulations. Consistent speedups of 8% were achieved in the integration kernel routines as a result of the serial optimisation. When using 8 OpenMP threads per MPI process, speedups of up to 10x for the 3D FFT were achieved, and for some combinations of MPI processes and OpenMP threads, overall speedups of 66% for the whole code were measured. As a result of this work, a proposal for full PRACE Project Access has been submitted

    Introducing Parallelism to the Ranges TS

    Get PDF
    The current interface provided by the C++17 parallel algorithms poses some limitations with respect to parallel data access and heterogeneous systems, such as personal computers and server nodes with GPUs, smartphones, and embedded System on a Chip chipsets. In this paper, we present a summary of why we believe the Ranges TS solves these problems, and also improves both programmability and performance on heterogeneous platforms. The complete paper has been submitted to WG21 for consideration, and here we present a summary of the changes proposed alongside new performance results. To the best of our knowledge, this is the first paper presented to WG21 that unifies the Ranges TS with the parallel algorithms introduced in C++17. Although there are various points of intersection, we will focus on the composability of functions, and the benefit that this brings to accelerator devices via kernel fusion

    Leveraging task-parallelism in message-passing dense matrix factorizations using SMPSs

    Get PDF
    In this paper, we investigate how to exploit task-parallelism during the execution of the Cholesky factorization on clusters of multicore processors with the SMPSs programming model. Our analysis reveals that the major difficulties in adapting the code for this operation in ScaLAPACK to SMPSs lie in algorithmic restrictions and the semantics of the SMPSs programming model, but also that they both can be overcome with a limited programming effort. The experimental results report considerable gains in performance and scalability of the routine parallelized with SMPSs when compared with conventional approaches to execute the original ScaLAPACK implementation in parallel as well as two recent message-passing routines for this operation. In summary, our study opens the door to the possibility of reusing message-passing legacy codes/libraries for linear algebra, by introducing up-to-date techniques like dynamic out-of-order scheduling that significantly upgrade their performance, while avoiding a costly rewrite/reimplementation.This research was supported by Project EU INFRA-2010-1.2.2 \TEXT:Towards EXa op applicaTions". The researcher at BSC-CNS was supported by the HiPEAC-2 Network of Excellence (FP7/ICT 217068), the Spanish Ministry of Education (CICYT TIN2011-23283, TIN2007-60625 and CSD2007- 00050), and the Generalitat de Catalunya (2009-SGR-980). The researcher at CIMNE was partially funded by the UPC postdoctoral grants under the programme \BKC5-Atracció i Fidelització de talent al BKC". The researcher at UJI was supported by project CICYT TIN2008-06570-C04-01 and FEDER. We thank Jesus Labarta, from BSC-CNS, for helpful discussions on SMPSs and his help with the performance analysis of the codes with Paraver. We thank Vladimir Marjanovic, also from BSC-CNS, for his help in the set-up and tuning of the MPI/SMPSs tools on JuRoPa. Finally, we thank Rafael Mayo, from UJI, for his support in the preliminary stages of this work. The authors gratefully acknowledge the computing time granted on the supercomputer JuRoPa at Jülich Supercomputing Centrer.Peer ReviewedPreprin

    Leveraging task-parallelism in message-passing dense matrix factorizations using SMPSs

    No full text
    In this paper, we investigate how to exploit task-parallelism during the execution of the Cholesky factorization on clusters of multicore processors with the SMPSs programming model. Our analysis reveals that the major difficulties in adapting the code for this operation in ScaLAPACK to SMPSs lie in algorithmic restrictions and the semantics of the SMPSs programming model, but also that they both can be overcome with a limited programming effort. The experimental results report considerable gains in performance and scalability of the routine parallelized with SMPSs when compared with conventional approaches to execute the original ScaLAPACK implementation in parallel as well as two recent message-passing routines for this operation. In summary, our study opens the door to the possibility of reusing message-passing legacy codes/libraries for linear algebra, by introducing up-to-date techniques like dynamic out-of-order scheduling that significantly upgrade their performance, while avoiding a costly rewrite/reimplementation. © 2014 Elsevier B.V. All rights reserved.This research was supported by Project EU INFRA-2010-1.2.2 ‘‘TEXT: Towards EXaflop applicaTions’’. The researcher at BSC-CNS was supported by the HiPEAC-2 Network of Excellence (FP7/ICT 217068), the Spanish Ministry of Education (CICYT TIN2011-23283, TIN2007-60625 and CSD2007-00050), and theGeneralitat de Catalunya (2009-SGR-980). The researcher at CIMNE was partially funded by the UPC postdoctoral grants under the programme ‘‘BKC5-Atracció i Fidelització de talent al BKC’’. The researcher at UJI was supported by project CICYT TIN2008-06570-C04-01 and FEDER. We thank Jesus Labarta, from BSC-CNS, for helpful discussions on SMPSs and his help with the performance analysis of the codes with Paraver . We thank Vladimir Marjanovic, also from BSC-CNS, for his help in the set-up and tuning of the MPI/SMPSs tools on JUROPA . Finally, we thank Rafael Mayo, from UJI, for his support in the preliminary stages of this work. The authors gratefully acknowledge the computing time granted on the supercomputer JUROPA at Jülich Supercomputing Centre.Peer Reviewe

    Leveraging task-parallelism in message-passing dense matrix factorizations using SMPSs

    No full text
    In this paper, we investigate how to exploit task-parallelism during the execution of the Cholesky factorization on clusters of multicore processors with the SMPSs programming model. Our analysis reveals that the major difficulties in adapting the code for this operation in ScaLAPACK to SMPSs lie in algorithmic restrictions and the semantics of the SMPSs programming model, but also that they both can be overcome with a limited programming effort. The experimental results report considerable gains in performance and scalability of the routine parallelized with SMPSs when compared with conventional approaches to execute the original ScaLAPACK implementation in parallel as well as two recent message-passing routines for this operation. In summary, our study opens the door to the possibility of reusing message-passing legacy codes/libraries for linear algebra, by introducing up-to-date techniques like dynamic out-of-order scheduling that significantly upgrade their performance, while avoiding a costly rewrite/reimplementation.This research was supported by Project EU INFRA-2010-1.2.2 \TEXT:Towards EXa op applicaTions". The researcher at BSC-CNS was supported by the HiPEAC-2 Network of Excellence (FP7/ICT 217068), the Spanish Ministry of Education (CICYT TIN2011-23283, TIN2007-60625 and CSD2007- 00050), and the Generalitat de Catalunya (2009-SGR-980). The researcher at CIMNE was partially funded by the UPC postdoctoral grants under the programme \BKC5-Atracció i Fidelització de talent al BKC". The researcher at UJI was supported by project CICYT TIN2008-06570-C04-01 and FEDER. We thank Jesus Labarta, from BSC-CNS, for helpful discussions on SMPSs and his help with the performance analysis of the codes with Paraver. We thank Vladimir Marjanovic, also from BSC-CNS, for his help in the set-up and tuning of the MPI/SMPSs tools on JuRoPa. Finally, we thank Rafael Mayo, from UJI, for his support in the preliminary stages of this work. The authors gratefully acknowledge the computing time granted on the supercomputer JuRoPa at Jülich Supercomputing Centrer.Peer Reviewe
    corecore